Some thoughts about compression

After watching the online seminars from BP06 (Breakpoint 2006) from the likes of Chaos (FR), kb (FR), the highly entertaining "happy hardcode crash course for coders" by skrebbel (skrebbel, you should give up the demoscene and try full time comedy!!), the conspiracy introsystem, size coding by ryg (FR), the demoscene.tv seminar by Gargaj and (in my very, very humble opinion one of the best) "ideas and techniques for smaller executables" by muhmac (aka Marcus Winter) I started re-reading many compression articles and looking for what the 'state of the art' currently is.

What is surprising is that there is no one, single compression algorithm; in fact most of the top archiver packages are hybrid versions of different compression techniques. It was once true that a single compression scheme was enough, it was universal and was used for everything; code, data, sound, models, etc. Once the limitation of the hardware (cost, power and storage space) prevented many techniques from being attempted. In the field of compression this was true for statistical modelling but as we have all seen the price of memory drop and sheer CPU MHz skyrocket the once forbidden algorithms are not only possible, but vital for obtaining some of the best compression ratios around. Look at LZMA (LZSS linked with a Markov Model and Arithmetic coder – as far as I can figure out) while there so few descriptions of the actual algorithm around (or maybe, I'm looking in the wrong search engines) - it is clear that storage and CPU power make compression hybrids a new route of exploration. So the creator of a packer must be fluent in many algorithms and be able to combine them.

Another unexpected discovery was how much effort and research is being directed towards pattern matching in terms of multi-media data filters and algorithmic recognition. By this I mean most of the top archivers use different algorithms and/or shaping parameters to tweak their multi-approach packers towards the input data itself. When you think about it, this is a logical step in compression. Now that the universal compression algorithms are maxed-out by what they are able to delivery the next step is to adapt not only the data model to the input data but also to adapt the algorithm to the input data. Of course the last adaption is really just a selection of algorithms from a pre-determined set of different approaches.

The next step would of course be to create custom-build algorithms for each and every data file but this creates practical problems such as how and where should this algorithm be stored? Would there be any data, or just an algorithm to re-create the original data file? There have been many attempts at a 'magic number generator' to compress data but sadly none have succeeded. It would be nice to have a small 50-100 byte routine which would re-create a 1 Meg file perfectly but time constraints prohibit any reasonable attempt at this approach. Imagine having to run a routine to generate 1 Meg of data, comparing its output against the original file and then modifying the routine until it matches. Of course there is no guarantees that the output would ever match!

We've all seen this idea in reverse. Look at the texture, model and sound generation techniques of any 256-byte to 64K demo/intro and you will see how small routines generate huge MB's of data fairly quickly, but these are guided by human intervention – there is no automatic creation tool. When you think about these tools and techniques they are a 'lossy' form of compression, the artists don't care if a pixel on texture 329 at position 15,82 is incorrect so long as the overall texture looks and feels right. In fact the audience wants variation, the human brain craves interesting, non-repeating patterns that are semi-structured (random noise sucks, but so does the same texture on every object in the world). I'm sure that in the generation vs. compression field that the generation will always win no matter how fancy your hybrid algorithm is. The simple fact is that time dictates what can and can-not be done. A generator might take a second to create a new interesting texture of 3d-shape, but an artist using 3d-max(price) will take an hour then a few days trying to squeeze the data down to a size which will always be bigger than the parameters required by the generator approach.

I guess this is why intro/demo-systems are so popular right now. The advantage of an authoring package is time. Objects, movement and audio/visual elements can all be tweaked and replayed to near perfection.

Behind the scenes of any exe-packer are a few tricks used to make the executable compress better such as turning relative offsets into absolute ones and/or op-code reordering which splits the multi-byte instruction into separate data streams. Another popular way to make code compress better is through the use of virtual machines but for most people a simple exe-packer to turn a C++ monster into a petite code file is the most popular path.

What I haven't seen (yet) is a custom C/C++ compiler that produces highly compressible code by reduced register usage and limited instruction-set, or maybe this will be the next version of kkrunchy++ ;)

TAD 15 July 2006